Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup process_uc_wildcards #193

Merged
merged 24 commits into from
Feb 23, 2024
Merged

Speedup process_uc_wildcards #193

merged 24 commits into from
Feb 23, 2024

Conversation

SamRWest
Copy link
Collaborator

@SamRWest SamRWest commented Feb 21, 2024

Speeds up uc_wildcard processing by about 40x (8s -> 0.18s on Ireland benchmark) by omitting repeated regex matches, vectorising with merge and omitting some string processing.

The difference isn't obvious in the benchmark times because wildcards are a relatively small portion of the total runtime for smaller models, but it should have a significant impact on full model runs.

This is ready for a review now :)

@olejandro
Copy link
Member

btw, if you convert it to draft PR the CI will run on every push, as well.

def remove_positive_patterns(pattern):
return ",".join([word[1:] for word in pattern.split(",") if word[0] == "-"])


@functools.cache
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caches old results instead of recalculating the wildcard patterns that are often repeated across tables. Seems to give a small (a couple of percent) speedup.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Just wondering how large the cache gets for a big model like Ireland/Austimes? Wondering if we'd need to use @functools.lru_cache(maxsize=...) at some point instead. Perhaps something to keep in mind if memory usage is ever a problem.

xl2times/transforms.py Outdated Show resolved Hide resolved
xl2times/transforms.py Outdated Show resolved Hide resolved
xl2times/transforms.py Show resolved Hide resolved
organised imports, appeased linters
@siddharth-krishna siddharth-krishna changed the title Feature/wildcard speedup Speedup process_uc_wildcards Feb 22, 2024
Copy link
Collaborator

@siddharth-krishna siddharth-krishna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks again!

xl2times/transforms.py Show resolved Hide resolved
def remove_positive_patterns(pattern):
return ",".join([word[1:] for word in pattern.split(",") if word[0] == "-"])


@functools.cache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Just wondering how large the cache gets for a big model like Ireland/Austimes? Wondering if we'd need to use @functools.lru_cache(maxsize=...) at some point instead. Perhaps something to keep in mind if memory usage is ever a problem.


summary = run(parse_args(args))
# Call the conversion function directly
summary = run(parse_args(args))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, thinking about it again, perhaps there is one reason to use subprocess -- at least in the CI should we check that xl2times works as expected from the command line? But on the other hand, run(parse_args( is pretty much the same as the CLI invocation, and CI is probably faster without subprocess...

I'm undecided, so would love your thoughts. And we can leave it as is in this PR and discuss in an issue, maybe?

with ProcessPoolExecutor(max_workers=max_workers) as executor:
results = list(executor.map(run_a_benchmark, benchmarks))
if debug:
# bypass process pool and call benchmarks directly if --debug is set.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love to have this documented in the CLI help for --debug, thanks!

xl2times/transforms.py Outdated Show resolved Hide resolved
xl2times/transforms.py Outdated Show resolved Hide resolved
)
matches = [
df.iloc[:, 0].to_list() if df is not None and len(df) != 0 else None
for df in matches
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious about this code: is matches a list of DataFrames at the end of the previous statement? Perhaps the FIXME can be resolved if we change the unique_filters.apply(lambda row: ... line to something like:

matches = [matcher(row, dictionary) for _index, row in unique_filters.iterrows()]

@SamRWest SamRWest merged commit e9c9fd3 into main Feb 23, 2024
1 check passed
@SamRWest SamRWest deleted the feature/wildcard_speedup branch February 23, 2024 02:22
olejandro pushed a commit that referenced this pull request Feb 23, 2024
Now I remember why I used subprocess to call `xl2times` from
`run_benchmarks.py`: in the CI, when it switches to the main branch, it
needs to run the main branch's version of the tool. But if we call the
tool as a python function, I think some of the PR version of the tool
remains in memory, and we don't get what we want. This looks to be the
reason CI is failing on this PR #183 :

https://github.com/etsap-TIMES/xl2times/actions/runs/8020431584/job/21910275023

(The error is that it can't find a file in `xl2times/config/...` that
was added by the PR when it is running tests in the main branch.)

This PR undoes the change from #193 that removed the use of subprocess.
See also
https://github.com/etsap-TIMES/xl2times/pull/193/files?diff=unified&w=0#r1498683705
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants